You will have access to the code, data, slides, and training materials via the training page and ARSET GitHub. I highly recommend that you watch my demonstration and then explore the code after the training has concluded. I have pre-computed some results for the sake of time, so if you try to follow along you will certainly fall behind.
Today we will be looking at supervised classification by creating a random forest model from training data.
We will begin by loading all of the libraries we will use in this training. This is the same set we used in the code for Part 1. We should also set the random seed so that we always get the same results from random (or pseudo-random) operations.
library(randomForest)
library(randomForestExplainer)
library(tidyverse)
library(terra)
library(tidyterra)
library(patchwork)
set.seed(20260226)
Just as we did in Part 1, we need to load the raster data for 2017 and 2024. I’m also going to go ahead and make RGB plots for both years so we have them ready if we need to use them as base maps later. I won’t go over this since you’ve seen the process before. You can refer to the comments in the code block for a quick refresher.
# Read the multiband rasters for 2017 and 2024
y17 <- rast('data/rast/HLS_2017_multiband.tif') %>%
RGB(c(4,3,2)) # Assign the RGB bands per the HLS band numbering system
y24 <- rast('data/rast/HLS_2024_multiband.tif') %>%
RGB(c(4,3,2)) # The RGB bands are 4, 3, and 2 since band 1 is Coastal Aerosol
# Create a spatial raster collection containing the rasters for both years
hls <- sprc(y17, y24)
# Create and store a plot for each 2017 and 2024
p17 <- ggplot() +
geom_spatraster_rgb(data = hls[1], stretch = 'hist') +
ggtitle('2017') +
theme_void()
p24 <- ggplot() +
geom_spatraster_rgb(data = hls[2], stretch = 'hist') +
ggtitle('2024') +
theme_void()
We can now load the training points I have already prepared for this training. The points are associated with the image from 2017 since we will build our model on the 2017 data before applying it to the 2024 data.
I also apply the na.omit function to remove any points
that are missing a class label. The training set is small enough we
could do a manual check for missing class labels, but this is good
practice for cleaning any training data.
tpts <- vect('data/vect/training_points_2017.gpkg') # These are points on the 2017 data
tpts <- na.omit(tpts) # Remove any points without data
It is useful to take a look at the training points in the context of
the imagery, so let’s plot them on top of the RGB plot for 2017 we
stored earlier. The points should be colored according to their class so
we can confirm they are representing the land cover types we expect. We
are using a variant of the Okabe-Ito1 colors for the points here and setting the
fill aesthetic to match the lc_class variable.
# Set a color ramp
okabe <- c("#D55E00", "#009E73", "#F5C710", "#56B4E9", "#0072B2", "#E69F00", "#CC79A7")
# Plot the points over the data
p17 +
geom_spatvector(data = tpts,
mapping = aes(fill=lc_class),
size=4,
shape=21,
color='black',
alpha=0.7) +
scale_fill_discrete("LC Class", type=okabe) +
NULL
We have a situation where our training points and the underlying
raster are spatially associated, but the training points don’t have any
information about the reflectance values stored in the raster. We can
extract the reflectance values of the raster at each of the training
points using the extract function with the
bind option set to TRUE so that the results
include the reflectance values as well as the landcover class labels. We
also need to ensure that the lc_class variable is stored as a factor
(the R data type for categorical values) or randomForest
will assume we are doing regression and throw an error because
regression on a character vector is undefined.
tpts_refl <- extract(y17, tpts, bind = TRUE)
tpts_refl$lc_class <- as.factor(tpts_refl$lc_class)
Here’s a quick check of our combined training data to make sure everything worked as expected.
head(tpts_refl)
## lc_class Coastal_Aerosol Blue Green Red Red_Edge1 Red_Edge2 Red_Edge3
## 1 water 0.0097 0.0161 0.0497 0.0698 0.0629 0.0165 0.0197
## 2 water 0.0009 -0.0027 0.0010 0.0029 0.0023 0.0025 0.0040
## 3 water -0.0057 -0.0009 0.0179 0.0231 0.0184 0.0006 0.0031
## 4 water 0.0317 0.0353 0.0600 0.0778 0.0695 0.0315 0.0363
## 5 water 0.0021 -0.0004 0.0065 0.0087 0.0069 0.0009 0.0022
## 6 forest 0.0081 0.0078 0.0240 0.0116 0.0407 0.1712 0.2192
## NIR_Broad Water_Vapor Cirrus SWIR1 SWIR2 NIR_Narrow
## 1 0.0125 -0.0169 5e-04 -0.0042 -0.0014 0.0046
## 2 0.0013 -0.0073 7e-04 0.0018 0.0028 0.0032
## 3 -0.0011 -0.0194 5e-04 0.0010 0.0028 -0.0006
## 4 0.0274 -0.0029 3e-04 0.0059 0.0051 0.0223
## 5 -0.0001 -0.0096 5e-04 0.0001 0.0005 0.0004
## 6 0.2336 0.0350 8e-04 0.1079 0.0383 0.2563
We see that lc_class is listed as <fctr> which
indicates that it’s been converted to the R factor data type. The
reflectance values are still listed as <dbl> which is
the double-precision floating point (numeric) data type. With that
sorted we are free to move on.
There are a few important elements to the randomForest
function call. We will go ahead and run the function here and then
explain what each part is doing.
rf17 <- randomForest(formula = lc_class ~ .,
data = tpts_refl,
proximity = T,
importance = T,
ntree = 100,
do.trace = F)
The following is a breakdown of the arguments to the
randomForest function call. Note that these are not all of
the possible arguments, just the ones that I think are core to
understanding RF and/or simply practical to know:
formula =
A formula describing the relationship between the objective and predictors
lc_class ~ .
lc_class is predicted ~ by
all of the variables . in data
lc_class is excluded since it is the target of the
predictionclass ~ predictors
. just stands in for “everything else”data =
The data from which to draw all class and value information
Must contain labels as factors for classification
proximity = T
This value is closer to 1 when the pair of observations are more similar2
We will look at how to visualize this in the next part
Set to F turn this off
importance = T
ntree = 100
do.trace = F
If set to 1 or T/TRUE then returns verbose information for each tree built.
For any n >1 will give the verbose information for every n trees
We now have a trained RF model which describes the classification process. To return to the example in the slides, you can think of this as being equivalent to the decision tree with all of the Yes/No questions filled out.
Importantly, the model object is not a raster that
we can plot. In order to get from the model object to a raster output we
have to apply the trained model using the predict function.
Since the model is already trained we can apply it to both the 2017 and
2024 imagery.
pred17 <- predict(hls[1], rf17)
pred24 <- predict(hls[2], rf17)
Now that we have two maps of land cover change in 2017 and 2024 we can create a change matrix which shows how many pixels of each class in 2017 became pixels of the same or a different class in 2024.
To do this, we will join the single-band rasters for each year into a multi-band raster and use the crosstab function.
I am going to do this two slightly different ways. The first will return a matrix-style table, whereas the second will return the same information in a “long” format which is more conducive to sorting and a bit easier to read.
change_mat <- c(pred17, pred24) %>%
crosstab()
names(dimnames(change_mat)) <- c('class_2017', 'class_2024')
change_mat_long <- c(pred17, pred24) %>%
crosstab(long = T) %>%
rename(class_2017 = class,
class_2024 = class.1,
n = n)
First we will look at the change matrix in the typical form. In this case the rows correspond to the predicted land cover class in 2017 and the columns correspond to the predicted land cover class in 2024.
change_mat
## class_2024
## class_2017 barren forest veg water
## barren 478020 23045 9752 7347
## forest 1501442 5821955 1365912 42142
## veg 793919 230842 160920 11004
## water 1487491 90207 168047 1202235
This may look a bit complicated at first, but let’s just consider the changes in pixels that were classified as water in 2017.
barren forest veg water
water 1487464 90246 168009 1202249
Here we see that of the pixels classified as water in 2017:
1,487,464 were later classified as barren
90,246 were later classified as forest
168,009 were later classified as vegetation
1,202,249 remained classified as water
Notice that when something is classified as a change from water to water we can say instead that those pixels did not change, and all of these A:A pairs lie along the diagonal.
Since the diagonal of the matrix represents the number of pixels which did not change (e.g. a “change” from barren to barren) and the sum across the rows represents the total number of pixels for that class in 2017, we can very easily calculate the percentage of each class that remained unchanged.
# Divide the unchanged (diagonal) by the 2017 total (row sums)
diag(change_mat) / rowSums(change_mat)
## barren forest veg water
## 0.9225265 0.6667798 0.1344715 0.4078165
We see immediately that the barren class changed the least as 92.25% of the pixels classified as barren in 2017 were still classified as barren in 2024. Vegetation saw the most change as only 13.44% of the pixels classified as vegetation in 2017 were still classified as vegetation in 2024.
Referring back to the change matrix we may also notice that the 2017 vegetation pixels most commonly changed to barren in 2024 (793,953 pixels). With only a bit more math we can calculate that 66.35% of the vegetation pixels became barren pixels.
change_mat['veg', 'barren'] / sum(change_mat['veg', ])
## [1] 0.6634319
Notice that this matrix format gives us a handy
change_mat[ 'from_class', 'to_class' ] syntax for isolating
specific changes between pairs of classes.
The long version I mentioned earlier simply takes all of the information in the change matrix and stretches it out such that each of the possible “from_class” “to_class” pairs are listed individually. This is particularly useful if you want to filter and sort your data to get the magnitude of change.
change_mat_long %>%
filter(class_2017 != class_2024) %>% # Filter to remove the diagonal (non-changes)
sort('n', -1) # Sort by descending count
## class_2017 class_2024 n
## 4 forest barren 1501442
## 10 water barren 1487491
## 5 forest veg 1365912
## 7 veg barren 793919
## 8 veg forest 230842
## 12 water veg 168047
## 11 water forest 90207
## 6 forest water 42142
## 1 barren forest 23045
## 9 veg water 11004
## 2 barren veg 9752
## 3 barren water 7347
You could summarize the data further, such as when you’re looking for the total loss/gain of a particular LC class.
veg_gain <- change_mat_long %>%
filter(class_2024 == 'veg' & class_2017 != 'veg') %>% # Is veg now AND wasn't veg before
select(n) %>% # select the counts
sum() # sum of the counts
veg_loss <- change_mat_long %>%
filter(class_2024 != 'veg' & class_2017 == 'veg') %>% # Isn't veg now AND was veg before
select(n) %>% #select the counts
sum() # sum of the counts
data.frame(class = 'veg',
loss = veg_loss,
gain = veg_gain,
ratio = veg_loss/veg_gain)
## class loss gain ratio
## 1 veg 1035765 1543711 0.6709578
Now we can see that we’ve isolated the overall changes in vegetation, measured the magnitude of both gain and loss, and determined that the vegetation class only lost 67.1% of the pixels it gained. To put that another way, even though vegetation experienced the greatest change overall, it still had a gain of ~1.5 pixels for each pixel lost. We can therefore say that the vegetation class expanded between 2017 and 2024 overall, but areas which were previously vegetation did not tend to remain vegetation.
The change matrices give us a great quantitative insight into the magnitude and direction of change over time, but it is also important to understand where these changes occur in space. We will return to the slides briefly where I will lay out the process for turning all of this information into a map of LCLUC.
Just as we did in Part 1 of this training, I am going to use a plotting helper function so that all of our plots use a consistent layout and legend. This is very similar to the plotting helper function in Part 1 so I won’t go over it in detail.
rf_predicted_raster_plot <- function(predicted_raster) {
valid_lc_classes <- unique(predicted_raster$class)
ggplot() +
geom_spatraster(data = predicted_raster) +
scale_fill_discrete(name='Class',
labels = valid_lc_classes,
palette = okabe,
na.value = 'black') +
theme_void()
}
I will now use the plotting helper function to create plots of each of the classified rasters and arrange them side-by-side.
p_pred17 <- rf_predicted_raster_plot(pred17)
## <SpatRaster> resampled to 501264 cells.
p_pred24 <- rf_predicted_raster_plot(pred24)
## <SpatRaster> resampled to 501264 cells.
(p_pred17 | p_pred24)
In the 2017 image on the left we see the large extent of flooded areas outside the main course of the river which are replaced by the barren class in the 2024 image on the right. We also see some significant expansion of vegetation in parts of the forested area to the south of the river. This is giving us visual clues about the strange metrics regarding vegetation we were seeing in the change matrix. Although very few pixels which were vegetation in 2017 remain vegetation in 2024, the expansion of vegetation in the forested areas explains why there was still an overall gain in the vegetation class.
In the slides I mentioned that all we need to do in order to create a
binary change mask is to use the != logical inequality
operator. Let’s do that now and then plot the results.
change_mask <- pred17 != pred24 # Is the pixel different between 2017 and 2024?
ggplot() +
geom_spatraster(data=change_mask) +
scale_fill_discrete("LC Changed", # Name the legend labels
palette = c('grey20', 'grey90'), # False is dark, True is light
na.translate = F) + # There are no NA values, so remove it from the legend
theme_void() # Remove axis ticks
## <SpatRaster> resampled to 501264 cells.
There we have the binary change mask exactly as I demonstrated in the slides with the small model. Following exactly the same procedures, we might be interested in applying this mask back to one or both of the previous classified LC maps, which is as simple as multiplying the original map with this new change mask. However, the terra package gives us a much cleaner way of achieving the same result using the mask function and specifying that we want our mask value to be 0 (False).
pred17_change <- mask(pred17, change_mask, maskvalues = 0)
pred24_change <- mask(pred24, change_mask, maskvalues = 0)
Now we can plot both maps using the same helper function we used earlier.
p_pred17_chg <- rf_predicted_raster_plot(pred17_change)
## <SpatRaster> resampled to 501264 cells.
p_pred24_chg <- rf_predicted_raster_plot(pred24_change)
## <SpatRaster> resampled to 501264 cells.
p_pred17_chg | p_pred24_chg
By only highlighting the pixels that experienced change between 2017 and 2024, we can get a better idea of the locations where certain changes took place. If we think of these maps in terms of the change matrix, we have essentially visualized everything except for the diagonal of the matrix, which is now NA. We can now see in stark contrast the loss of forest to vegetation and the loss of water to barren as well as where in the scene these changes are occurring. That map on the right is the same one I had promised to show you how to make at the beginning of this presentation.
So, for one final time, we’re going to return to the slides for a summary and conclusion, and then we will open up the Q&A session.
Note: The following section gives some extra information and examples to aid in your understanding of RF models but in not covered as part of the training.
There are a few existing methods for looking at some of the model
performance metrics such as the print and
importance methods for random forest model objects and the
variable importance plot varImpPlot. This is what those
look like:
print(rf17)
##
## Call:
## randomForest(formula = lc_class ~ ., data = tpts_refl, proximity = T, importance = T, ntree = 100, do.trace = F)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## barren forest veg water class.error
## barren 7 0 0 0 0
## forest 0 9 0 0 0
## veg 0 0 8 0 0
## water 0 0 0 7 0
importance(rf17)
## barren forest veg water MeanDecreaseAccuracy
## Coastal_Aerosol 2.954400 2.2454436 0.3420817 -0.08421818 2.752780
## Blue 3.092208 2.2681069 1.8784902 1.41392504 3.120927
## Green 3.397263 4.0633339 2.6307237 2.23114062 4.574459
## Red 3.264720 1.1769375 2.8294085 -0.20004001 3.275475
## Red_Edge1 4.429112 3.3694703 1.4839650 2.68522378 4.665398
## Red_Edge2 2.023093 3.7683045 4.6621334 3.75105792 5.272667
## Red_Edge3 2.794713 3.4882858 4.8918334 4.63778755 5.622194
## NIR_Broad 1.740777 3.6275216 4.2179300 3.69791240 5.166713
## Water_Vapor 2.424339 2.5636484 3.3333333 3.29394505 4.080353
## Cirrus 2.731953 -0.2132492 0.0000000 2.61921094 3.382003
## SWIR1 4.332331 3.7006054 3.0045203 4.30398958 5.967464
## SWIR2 4.288362 2.8092561 2.0657071 4.14758579 5.140733
## NIR_Narrow 2.709057 3.0974316 2.9977303 3.84889283 4.697426
## MeanDecreaseGini
## Coastal_Aerosol 0.7035641
## Blue 0.9414973
## Green 1.6053586
## Red 0.9846855
## Red_Edge1 1.8067434
## Red_Edge2 2.4043788
## Red_Edge3 2.6898228
## NIR_Broad 1.9429003
## Water_Vapor 1.3828927
## Cirrus 0.9058164
## SWIR1 2.7725011
## SWIR2 2.1303193
## NIR_Narrow 2.0914550
varImpPlot(rf17)
However, there are far more powerful evaluation metrics that don’t require much extra work.
The package {randomForestExplainer}, which is already
installed in this document, provides a wide array of functions to help
you understand the characteristics of your RF models. There is even a
function called explain_forest which wraps all of these
results into a nice html report. Below I have given just one example of
a particularly useful plot from the package which shows the importance
of each variable (band) in the tree.
rf17 %>%
measure_importance() %>%
plot_multi_way_importance()
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the randomForestExplainer package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Our ability to apply a single model to two different scenes requires us to operate under the assumption that the reflectance characteristics of each LC class remain constant over time. This is not strictly true in the real world. For example, we know that the leaves of trees change appearance as a response to things like soil moisture and seasonality. The model is robust to these differences only if the training data captures the variability of real-world conditions.
Ideally, training data with the same class label would have a wide range of correlation values on the high end (roughly 75% or greater) without any observation being more closely correlated to training data observations of a different class. That is to say that we would prefer to maximize the variability within a class while minimizing the similarity between classes.
We can visualize the pairwise similarity for our training data using the proximity matrix from our RF model object. That lets us get a quick sense for which of our training data points might be too confusing (cross-class correlation) or classes with too little variability (all observations within the class have similarity ~1).
rf17_long <- as.data.frame(rf17$proximity) %>% # Coerce the proximity matrix into a df object
rownames_to_column(var = "yax") %>% # the row names (obs numbers) become a named column
pivot_longer( # coerce the data into a long format where each obs is a single row
cols = -yax, # ignore the observation number in reshaping
names_to = "xax", # names become x-axis labels
values_to = "corr") %>% # remaining values are called "corr" for correlation
mutate(
xvar=factor(xax, levels = 1:32), # Use 32 levels since there are 31 points
yvar=factor(yax, levels = 1:32) # The extra level is dropped silently
)
ggplot(rf17_long, mapping = aes(x = xvar, y = yvar, fill = corr)) +
geom_tile() + # Display as heat map matrix (tile geometry)
scale_fill_grass_c() + # Use the GRASS-style viridis color table
labs(x = "Training Class",
y = "Training Class",
fill = "Similarity") + # Add labels and guide
coord_equal() + # Force square matrix layout
scale_y_discrete(labels = tpts$lc_class) + # label each tpts obs by class
scale_x_discrete(labels = tpts$lc_class) + # ditto
theme(
axis.text.x = element_text( # Rotate and align the x-axis labels
angle = 90,
vjust = 0.5,
hjust=1)
)
Okabe, M., & Ito, K. (2008). Color universal design (CUD): How to make figures and presentations that are friendly to colorblind people.↩︎
Really this is the percentage of the random trees where two observations ended up in the same node, so it should be higher for observations that share a class and 1 (100%) along the diagonal↩︎
I tried this with 500, 200, and 100 trees and there isn’t any appreciable gain.
Our sample is very small relative to the study area and not truly representative. The study area and target classes are also intentionally simplified↩︎
Suffice to say that increasing ntree (or tweaking other parameters which increase computational complexity) does not always improve the accuracy of results. A parsimonious model is almost always preferable to a more complex one.
The optimal selection of parameters is both fascinating as an optimization problem and as a philosophical question. Unfortunately neither of those are conversations to be had in this training.↩︎